Lab 14

Computer Architecture I @ ShanghaiTech University

Goals

Learn about basic OpenMP directives.
Write code to learn two ways of how #pragma omp for could be implemented. Learn about false sharing.
Learn about basic multiprocessing programming.

Multi-threading programming using OpenMP

OpenMP stands for Open specification for Multi-Processing. It is a framework that offers a C interface. It is not a built-in part of the language – most OpenMP features are directives to the compiler.

Benefits of multithreaded programming using OpenMP include:

Very simple interface allows a programmer to separate a program into serial regions and parallel regions.
Convenient synchronization control (Data race bugs in POSIX threads are very hard to trace).

In this lab, we will practice on basic usage of OpenMP.

Exercises

Download source code from files for Lab 14 first.

Exercise 1 - OpenMP Hello World

Consider the implementation of Hello World (hello.c):

int main ()

{

  #pragma omp parallel

  {

    int thread_ID = omp_get_thread_num ();

    printf ("hello world %d\n", thread_ID);

  }

  return 0;

}

This program will fork off the default number of threads and each thread will print out "hello world" in addition to which thread number it is. You can change the number of OpenMP threads by setting the environment variable OMP_NUM_THREADS or by using the omp_set_num_threads function in your program. The #pragma tells the compiler that the rest of the line is a directive, and in this case it is omp parallel. omp declares that it is for OpenMP and parallel says the following code block (what is contained in { }) can be executed in parallel. Give it a try:

$ make hello && ./hello

If you run ./hello a couple of times, you should see that the thread IDs are not always in numerical order and will most likely vary across runs. Think about the reason and explain to your TA.

It is also vital to note that the variable thread_ID is local to a specific thread and not shared across all threads. In general with OpenMP, variables declared inside the parallel block will be private to each thread, but variables declared outside will be global and accessible by all the threads.

Exercise 2 - Matrix Multiplication

Matrix multiplication is a common operation in scientific computing and machine learning. In this exercise, we will optimize a matrix multiplication implementation using OpenMP. The matrix multiplication is implemented in matmul.c.

Your task is to optimize matmul.c (speedup may plateau as the number of threads continues to increase). To aid you in this process, two useful OpenMP functions are:

Divide up the work for each thread through two different methods (write different code for each of these methods):

First task, slicing: have each thread handle adjacent rows: i.e. Thread 0 will compute the rows at indices i such that i % omp_get_num_threads() is 0, Thread 1 will compute the rows where i % omp_get_num_threads() is 1, etc.
Second task, chunking: if there are N threads, break the matrices into N contiguous chunks along the first dimension (the rows), and have each thread compute the product of the chunk of matrix A and the entire matrix B.

Hints:

Use the two functions we listed above somehow in the for loop to choose which rows each thread handles in the slicing method.
You may need a special case to prevent going out of bounds for matmul_optimized_chunks. Don't be afraid to write one.
Be careful about cache line alignment and false sharing. To avoid false sharing, each thread should have its own output buffer to store the computed rows.

For this exercise, we are asking you to manually split the work amongst threads since this is a common pattern used in software optimization. The designers of OpenMP actually made the #pragma omp parallel for directive to automatically split up independent work. Here is the function rewritten using it. You may NOT use this directive in your solution to this exercise.

void matmul (double *a, double *b, double *c)

{

  #pragma omp parallel for 

  for (int i = 0; i < MATRIX_SIZE; i++) {

    for (int j = 0; j < MATRIX_SIZE; j++) {

      for (int k = 0; k < MATRIX_SIZE; k++) {

        c[i * MATRIX_SIZE + j] += a[i * MATRIX_SIZE + k] * b[k * MATRIX_SIZE + j];

      }

    }

  }

}

Test the performance of your code with:

$ make matmul && ./matmul

Checkoff

Show the modified code for both the slicing and chunking methods in matmul.c.

Describe the performance differences between the methods you implemented and try to analyze the reason(Run more times to find a common pattern instead of just running once).

Explain why using OpenMP may not necessarily lead to optimal performance on a single compute node with multiple cores.

Exercise 3 - Pi Calculation

In this exercise, we'll use OpenMP to calculate π (Pi) using numerical integration. The formula we'll use is:

π = ∫₀¹ (4/(1+x²)) dx

This can be approximated using a Riemann sum. At first glance, implementing this might seem straightforward, but the challenge is how to sum up all the partial results into the same variable (reduction). A sloppy handling of reduction may lead to data races: all the threads are trying to read and write to the same address simultaneously. One solution is to use a critical section. The code in a critical section can only be executed by a single thread at any given time. Thus, having a critical section naturally prevents multiple threads from reading and writing to the same data, a problem that would otherwise lead to data races. One way to avoid data races is to use the critical primitive provided by OpenMP. An implementation, pi_omp_naive in pi.c, protects the sum with a critical section.

Try out the code:

$ make pi && ./pi

Notice how the performance gets much worse as the number of threads goes up. By putting all the work of reduction in a critical section, we have flattened the parallelism and made it so only one thread can do useful work at a time (not exactly the idea behind thread-level parallelism). This contention is problematic; each thread is constantly fighting for the critical section and only one is making any progress at any given time. As the number of threads goes up, so does the contention, and the performance pays the price. Can we reduce the number of times that each thread needs to use a critical section?

In this exercise, you have 2 tasks:

Fix the performance problem without using OpenMP's built-in Reduction keyword.
Fix the performance problem using OpenMP's built-in Reduction keyword. (Note that your code should no longer contain #pragma omp critical)

Checkoff

Show the TA your manual fix to pi.c that gets speedup over the single threaded case.

Show the TA your Reduction keyword fix for pi.c, and explain the difference in performance between the two approaches.

Discuss why you see diminishing returns (or even performance degradation) as you increase the number of threads beyond a certain point.

Exercise 4 - Merge Sort Using OpenMP Tasks

In previous exercises, we focused on data parallelism with work-sharing constructs like #pragma omp for. However, not all algorithms can be easily expressed as simple loops. Recursive divide-and-conquer algorithms like merge sort present a different challenge for parallelization. OpenMP offers the task parallelism model to handle these cases.

OpenMP tasks allow you to dynamically create units of work that can be executed by any thread in the parallel region. This is particularly useful for recursive algorithms or irregular parallelism patterns. In this exercise, you will parallelize a merge sort algorithm using OpenMP tasks.

Understanding OpenMP Tasks

Here's how tasks work in OpenMP. For more details, it is recommended to refer to this introduction to OpenMP Tasking fundamentals.

A task represents an independent unit of work.
Tasks are created by threads and placed in a task pool.
Any thread can execute any task from the pool (this is different from work-sharing).
Tasks can create additional tasks (useful for recursive algorithms).
The #pragma omp taskwait directive synchronizes by waiting for all child tasks to complete.

The file mergesort.c contains a serial implementation of merge sort. Your job is to parallelize it using OpenMP tasks. There are three parts to complete:

Create a parallel region with a single thread acting as the task generator.
Implement a cutoff threshold to avoid creating too many small tasks.
Use tasks to recursively sort the two halves of the array in parallel.

Implementation Guide

Part 1: Create the initial parallel region:

Use #pragma omp parallel to create a team of threads.
Use #pragma omp single to ensure only one thread initiates the recursive task creation.
Call your recursive merge sort function from this single thread.

Part 2: Implement a cutoff for small arrays:

For small arrays (defined by CUT_OFF), use the serial merge sort directly.
This reduces the overhead of creating too many small tasks.

Part 3: Use tasks for recursive sorting:

For arrays larger than the cutoff, split the array into two halves.
Use #pragma omp task to create tasks for sorting each half.
Use #pragma omp taskwait to wait for both halves to be sorted before merging.
Call the merge operation after both sorting tasks are complete.

Try implementing this and compare the performance with the serial version.

$ make mergesort &&./mergesort

Checkoff

Show the TA your implementation of merge_sort_omp_tasks and merge_sort_omp_tasks_recursive.

Explain the importance of the taskwait directive in your implementation. What would happen if you omitted it?

Discuss the tradeoffs in choosing different CUT_OFF values. How does this value affect performance?

The following TA(s) are responsible for this lab: Zhixin Xiao <xiaozhx2025@shanghaitech.edu.cn>